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Abstract 

In the Bayesian stochastic search variable selecti on framework , a common prior distri 



C/3 



> 



["T^l , bution for the regression coefficients is the g-prior of IZellnen [l986j. However, there are two 

standard cases in which the associated covariance matrix does not exist, and the conventional 
prior of Zellner can not be used: if the number of observations is lower than the number 
of variables (large p and small n paradigm) , or if some variables are linear combinations of 
others. In such situations a prior distribution derived from t he prior of ZcUncr ca n be u sed, 



by introducing a ridge parameter. This prior introduced bv lGupta and I brahim is a 



flexible and simple adaptation of the g-prior. In this paper we study the influence of the 
ridge parameter on the selection of variables. A simple way to choose the associated hyper- 
\ parameters is proposed. The method is valid for any generalized linear mixed model and 

' we focus on the case of probit mixed models when some variables are linear combinations of 

, others. The method is applied to both simulated and real datasets obtained from AfFymetrix 

' microarray experiments. Results are compared to those obtained with the Bayesian Lasso. 

(N 

. Keywords: Stochastic Search Variable Selection, Bayesian Lasso, Zellner prior, ridge parame- 

ter, generalized linear mixed model, probit mixed regression model, Metropolis- within- Gibbs 
algorithm. 



1 Introduction 

We consider the problem of Bayesian variable selection in a generalized linear mixed model with 
Y a n-vector of responses, given a set of p potential fixed regressors 

g{E{Yi\U,(3)) = Xf(3 + ZfU, 

where g stands for the link function associated to the model, and Xj and Zi for the fixed and 
random effect regressors associated to the ith observation. The parameter f3 £MP corresponds 
to the fixed-effect coefficients and the parameter U to the random-effect coefficients. X and Z 
are known design matrices associated with the fixed and random effects. We consider K random 
effects, U = {Uj, • • • , U]^)^ where each Ui is a vector of size qi, and J2h=i H — I- 

In a stochastic search variable selection (SSVS) framework, it is convenient to denote by 
7 the vector of latent variables indicating if a variable is selected or not; that is, 7j = 1 if 
f3j 7^ and 7j = if f3j = 0. We then denote by f3^ the vector of all non-zero elements of /? 
and by the design matrix with columns corresponding to the elements of 7 that are equal to 1. 
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To complete the model, a conventional prior distribution for j3^\^ is a d^-dimensional Gaus- 



sian distribution, with = Yjj=i Ij^ 



^1) 



(1) 



Concerning the prior covariance matrix S^, an attractive and standard choice is 

^7 = ''"(^7^7)^"'"- 



Equations ([T]) and ([2]) correspond to the 5-prior distribution, proposed bv IZellneri 1986l | in the 
case of standard linear models. This prior replicates the covariance structure of the design and 
enables an automatic scaling based on the data. Up to the scalar r, the prior covariance matrix is 



related to the Fisher Information Matrix in the linear model [see for instance Chen and Ibrahim 



m 



2003] . Moreover, it leads to simple expressions of the marginal likelihood, and as pointed out 



George and Foster 2000l |. the marginal likelihood becomes a function of both R-square and 



the number of covariates like in A IC or BIC criteria. The para, meter r > is referred to as 



the variable selection coefficient in lBottolo and RichardsonI 20101 ]. In the homoscedastic linear 



model with variance <t^, this parameter can be expressed as r = ga'^ . Therefore this prior 
has been us ed by many authors in the ca se of linear models, but also for generalized linear 



models (see ISabanes Bove and Heidi 201ll |). In case of a binary response variable, this prior 



Lee et al.l. 200^1. 


Sha et al.. 


2004. 


Zhou et al.. 2004. 



The choice of the var iable selection coefficient r can have a great influence on the vari 



able selection process [see iGeorge and Fosterl . l2000l | and h as been considered by m any authors 



Some of them considered a fixed value for r. For instance ISmith and KohnI 1997] suggested t o 
choose T between 10 and 100. Another approach is the approach of George and Foster 200d ]. 
who developed empirical Bayes methods based on the estimation of r from its marginal like- 
lihood. Other authors proposed to put a hyper-prior distribution on r, like IZellner and Slow 



[198d] that used an inverse-gamma distribution X^(l/2, n/2). But under the Zellner-Siow prior, 
marginal likelihoods are not a vailable in closed forms, and approximations are necessary [see 
Bottolo and Richardson . l20ld ]. 

Not e also that the Zel lner-Siow prior can be seen as a mixture of 5-priors. Following this 



remark, Liang et al. 20081 ] proposed a new family of priors on r, the hyper-g prior family which 



leads to new mixtures of ^(-priors: the marginal likelihoods are then in closed forms, but not 
in a p ractical way because h ypergeometric functions are used. Independently but in the same 
spirit, Cui and George 20081 ] suggested to put an inverse-gamma prior distribution on (1 + r) 
(rather than on r like Zellner and Slo w), obtaining a family of priors on r which contains the 



hyper-g prior family as a specia l case. iBottolo and RichardsonI |2010l] used a sim ilar prior. In 



the linear regression framework, Celeux et al. 20061 ] and" Marin and Robert 2007 ] suggested an 



improper discrete prior on r. B ut this prior is diffic ult to use in practice because it induces an 
infii iite sum. As a consequenc e, ICeleux et al.l 20111 ] proposed a Jeffrey prior continuous on r, 
and IGuo and SpeckmanI |2009l ] showed the consistence of associated Bayes factors. 

In spite of the variety of all these works to choose the variable selection coefficient r, a crucial 
problem remains with priors using the matrix (X!^X^)^^. Indeed, X!^X^ should be invertible. 
However, there are two standard cases where X!^X^ is singular: 

• If the number of observations is lower than the number of variables in the model, n < d^. 

• If some variables are linear combinations of others. In practice, even if X!^X^ is theoreti- 
cally invertible, some variables can be highly correlated and X^X-y can be computation- 
ally singular. It is often the case in genomic high-dimensional datasets for example. This 
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problem can also be encountered when several datasets are merged: some variables can 
be collinear or almost collinear if same variables were present into several datasets under 
different labels for instance. 

In these cases the classic al q-prior does not work . Conc erning the first case, several authors 



proposed alternative priors. iMaruvama and Georgd 201 1[ | proposed a generalization of the g- 



prior, working with a singular value decomposit ion of the design matr ix X. But their approach 
is valid only in case of classical linear models. Yang and Song 201Clll propose d to replace the 



m 



S-y by its Moore Penrose's inverse (see alsc West 20031 ] 1. However, the 



matrix (X;^v».^y 

computation of the po sterior distribution has a techni cal issue that do not permit the use of 
MCMC algorithm (see iBaragatti and Pommeretl [201 1[ | ) . Another idea woul d be to avoid thi s 



2011]. 



first case by fixing the number of selected covariates at each iteration, as in iBaragatt 
It appeared computationally advantageous and it reduced the effect of the variable selection 
coefficient r used in the g-pnor. But the number of selected variables at each iteration must 
be arbitrarily fixed. Moreover, fixing the nur nber of selected covariates is n ot a solution for the 
second CclSG, clS well as the priors propos ed bvlMaruyama and Georgd [201 II] andlYang and Sona 
' 2OIOI ]. In a spirit of ridge regression (see lMarquardtl [l970l ]). ^upta and IbrahimI [20071 ] proposed 



an extension of the g-puov, by introducing a ridge parameter. Their prior can be used in the two 
cases, but they did not study th e secon d case in which some variables are linear combinations 



of others. Recently iKwon et al.l 20111 ] proposed a variable selection method which take into 



account high correlations between predictors, but again their approach is not valid when some 
variables are linear combinations of others. More generally, to our knowledge the problem 
of variable selection when some variables are linear combinations of others is not present in 
literature. 



In this paper we develop the idea of iGupta and IbrahimI 20071 ] concerning the introduction 
of a ridge parameter, and we study the influence of this parameter on the selection of variables. 
Besides , we su g gest a way to choose the associated hyper-parameters: following the original 
idea of Izellnei] [1986^ which is to keep the covariance structure of the design, we propose to 
keep the tot a.1 variance of th e data through the trac e of X-^X. We focu sed on probit models, 
as studied in IBaragatti 2011 ]. Yang and Song 20ld ] and Lee et al. 2003]. The aim is to study 
the behavior of the variable selection process while using the proposed prior, especially when 
some variables are linear combinations of others. The approach developed is applied both to 
simulation data, and to data obtained from Affymetrix microarray experime nts. We compare 



the nu mer ical results w ith those obtained by a Bayesian Lasso approach (see iPark and Casella 
20081 ] and Hand 2009 ] for a recent review) in the context of probit mixed models. 



This paper is organized as follows. In Section [2] the extension of the prior to be used for 
is introduced and a choice for the hyper-parameters is suggested. The Section [3] outlines the 
priors, full conditional distributions and the sampler to be used in case of a probit mixed model, 
for both the SSVS approach and a Bayesian Lasso approach. In Section H] experimental results 
are given and a sensitivity analysis is performed. Finally Section [5] discusses the method. 



2 Introducing a ridge parameter 



2.1 Prior distribution of /3 with a ridge parameter 



As previousl y explained, in the case of s ingularity of the matrix XjTX^, the classical g'-prior can 



7 ^^7) 



not be used. iGupta and IbrahimI [20071 ] proposed to use a ridge par ameter, denoted A > 0, by 



replacing in ([2]) the matrix r ^X!^X^ by r ^ (X!^X^ -|- A/) . Imitating iGupta and IbrahimI [20071 ] 
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we write 



and we consider the prior 



;0, (r-iX?;X^ + XI) 



with 



p 



(3) 



(4) 



Since A is strictly positive, the matrix S^(A) is always of full rank and @ can be viewed as 
a modified form of the 5-prior, which is a compromise between independence and instability. 
Indeed, for large values of A and r, S^(A) is close to a diagonal matrix that coincides with the 
conditional independent case. On the opposite, for small values of A and r, the term r~^X!^X^ 
prevails and the inverse of r~^X!^X^ + XI will be instable if X^X^ is singular. In that case, the 
prior distribution ([3]) is close to the (7-prior case. 



2.2 Calibrating hyper-parameters 

Following [Zelhierl [l98d |. our purpose is to use the design to calibrate the covariance of /3-y with 



ro(X;^X^) , with tq the fixed hyper- parameter used in 



a ridge parameter. Write S^(0) 
this classical prior. Using 5]^(A) instead of 5]^(0) amounts to introducing a perturbation in the 
classical g-pnoi. An interesting feature of the classical g(-prior is that the variance-covariance 
structure of the data is preserved. The ridge parameter prevents us to strictly preserve this 
structure. However, it is possible to replicate the total variance of the data, which corresponds, 
up to a normalization, to the trace of S-y(O)^^. The constraint used is then 



tr(s^(0)-i) =tr(s^(A)-i), 



which yields 



To 



1 + 



XpTo 



with the condition Xpro 7^ tr(X!^X^) 



tr(X.l^X^) - XpTQ- 

Concerning the choice of A, in order to take into account 

we suggest to take 



the number p of covariates and to reduce the effect of the ridge factor 
X = 1/p, getting 

, , ^0 

To 



with the condition tq 7^ tr(X!^X^) 



^ ~^ ir(X;^X^) - To- 
The vector 7 can be different between two iterations of the 



algorithm. Therefore we propose to use the complete design matrix X instead of X^, yielding 

To 



To 



1 



tr(X^X)-ro 



(5) 



with Tq 7^ tr(X"^X). In practice, the user has to choose on l y the parameter tq, as A and r are 
then obtained by 1/p and (0). Following ISmith and KohnI [19971], tq could be chosen between 



10 and 100, and not too close to tr(X^X). It is of interest to study the influence of the hyper- 
parameters A and r. In Section 14.31 it will be show that these hyper-parameters do not have a 
large bearing on the results. 

Remark 1 The choice X = 1/p has the advantage to be automatic and to reduce the influence of 
the ridge parameter when the number of variables is large. However it can lead to computational 
instability if this number is too large and hence X too small, since S^(A) is then almost singular. 
In our numerical study we did not encountered this problem for p around 300. But for very large 
p we could add a threshold e and then choose X = max(l/p, e). 
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3 Illustration trough a mixed probit model 
3.1 The probit mixed model 

We consider the problem of variable selection among a set of p potential fixed regressors, in the 
following probit mixed model 

P{Yi = l\U,(3)=pi = <^(Xf /3 + Zf [/), 



where $ st ands for the stan dard Gaussian cumulative distribution function. Following [Albert and Chib 



I993I ] and Lee et al. 20031 ] . a vector of latent variables L = [Li, . . . , Ln)^ is introduced, and we 



assume that the conditional distribution of L is Gaussian, that is L | C/, /3 ~ J\fn{Xj3 + ZU,In), 
with In the identity matrix. We then have 

/ 1 if L, > , , 

^^ = 1 ifL,<0. 

3.2 Stochastic Search Variable Selection 
Prior and full conditional distributions 

We used the following prior distributions, which are classical except the one for /3-y: 

• As explained in Section [2l we use the prior (jH) for /3-y . 

• The 7j are assumed to be independent Bernoulli variables, with 

P{jj = 1) = TT, < vr < 1, (7) 
as we do not want to use prior knowledge to favor any variables. 

• The vector of coefficients associated with the random effects is assumed to be Gaussian 
and centered, with covariance matrix D: 

U\Dr^Mg{0,D). (8) 

We will consider the case where D is a diagonal matrix D = diag{Ai, . . . , Ak), where 
Al = (rflq^, I = 1, . . . , K and Iq^ the identity matrix. The prior distributions for the af are 
then Inverse Gamma IQamma{a,b) {b denoting the scale parameter). In a more general 
case, if no structure is assumed for the variance-covariance matrix D, its prior distribution 
should be an Inverse-Wishart. 

Most of the full conditional distributions did not depend on the ridge parameter. In particular: 

• The full conditional distribution of L is given by (see lAlbert and GhibI jl993l |): 



Li\l3, U,Yi = l ~ M{Xj(5 + ZjU, 1) left truncated at (9) 
Li\(5, [/, = ~ M{Xj(5 + ZjU, 1) right truncated at 0. 

Defining W = [Z^^ Z + Z)^^)^^, the full conditional distribution of U is: 

C/|L,/3,Z) ~AAg(VFZ^(L-X/3),VF). (10) 
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The full conditional distribution of the af ,1 = I, . . . , K are Inverse-Gamma: 



af I Ui ~ ZQamma{^ + a, {-^UfUi + b 

Only the full conditional distributions of and 7 depend on A, as follows: 
• For /3^: 



(11) 



l5^\L,U,-f^Md-,{V^lC^{L-ZU),V^), with 
• And for 7: 

(27r)-^ r 1 



^^^X^X, + A/ 



(12) 



/(7|L,[7,/3^) oc 



S,(A)|V2 



exp 



(/3^F- - (L - Z[/)^X^/3-^ - /3^X^(L - ZU)) 



p 



(13) 



The sampler 

The posterior distribution of 7 is of particular interest for the variable selection problem. An idea 
is to use a Gibbs sampler to explore the full posterior distribution and to search for high probabil- 
ity 7 values. Simulations from all the full conditional distributions can be easily obtained, except 
for 7 which full conditional distribution does not correspond to a standard multivariate one. The 
7 vector can be simulated either element by element, or by using a Metropolis-Hastings algo- 
rithm. In general, in the case of a high number of variables, the Metropolis-Hastings algorithm is 
computationally advantageous. Moreover, using a Metropolis-Hastings ste p in a Gibbs sampler 
improves the sampler in terms of variance, see iRobert and Casellal 2004i |. As a consequence, 
we decided to use a Metropolis-within-Gibbs algorithm. But even with a Metropolis-Hastings 
algorithm, the full conditional distrib ution of ^ canno t be directly simulated, since it depends 
on the ac tual value of (3^. Following Lee et al. 20031 ] we then used the grouping technique of 
Liul 199J], by considering the parameters 7 and jointly. The advantage of this technique is 
that the conve rgence of the Markov ch ain is improved, and autocorrelations are reduced, see 
Liul 1994( 1 and Ivan Dvk and ParkI 200811. Using this tec hnique is equivalent to integrate the full 



conditional distribution of 7 in (see 



Baragattil [201 ll ] for more details). We then obtain: 



fi7\L,U) oc 



|S,(A)|V2 



exp 



{L - ZUf{I - X^y^X;^)(L - ZU) 



Note that setting A = 0, we can recover the formula corresponding to the classical g-pnoi. 



(14) 



Remark 2 The influence of r appears here through the ratio R^^"^ 
that 

' if r — )■ 00, 
if T ^ 0, R^l, 
if A — )• 00, i? — )• 1, 

t_}_\d-,/2 

+ • 



\V-,\ 



1/2 



We can see 



if A ^ 0, 



R —7- I -^Xj^X^ + /| , 



R 
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The Metropolis-Hastings algorithm used to generate the 7 vector can be summarized as 
follows: at iteration (i+1) a candidate 7* is proposed from 7*^*^ and using a symmetric transition 
kernel the acceptance rate is 



with 

f{r\L,u) 

/(7«|L,C/) 



\ 1/2 

- j exp { - -(L - zc/)^(x^.v;(,)X5,) - X7.y7,x^o(i - zu)} 



fil*\L,U) 
/(7»|L,t/) 



n I 



i=i 



if VjG {!,..., p} vrj=7r. 



(15) 



The simplest way to have a symetric transition kernel is to prop ose a 7* which corresp onds 



to 7^'^ in which r cornponen ts have been randomly changed (see IChipman et alJ [200 1[] and 
George and McCullochI [19971 ] 



Remark 3 The influence of t appears via the ratio Q^^"^ 



S^* V I 



1/2 



that satisfies: 



f if T ^ 00, IX^.X^. + A/| X |X;^,X^» + A/|-i, 

if T ^ 0, Q^l, 

if A — )• 00, Q — ?• 1, 

t if A ^ 0, Q^l. 



Post-processing 

The number of iterations of the algorithm is b+m, where b corresponds to the burn-in period and 
m to the observations from the posterior distributions. For selection of variables, the sequence 
{7^*^ = (7^*^ . . . ,7p*^),t = 6+1, ... , b+m} is used. The most relevant variables for the regression 
model are those which are supported by the data and prior information. Thus they are those 
corresponding to the 7 components with higher posterior probabilities, and can be identified 
as the 7 components that are most often equal to 1. To decide which variables should be 
finally selected after a run, a confidence interval based on a Poisson distribution could be used. 
However we noticed that usually a reasonable number of relevant variables can be isolated from 
the others using the number of selections. Therefore we suggest to use a box-plot of the number 
of iterations during which variables were selected. For each run the variables distinguishable 
from the others can be selected by fixing a threshold: if a variable has been selected during a 
number of iterations which is higher than this threshold, then the variable is kept in the final 
selection. 



3.3 Bayesian Lasso for probit mixed models 

A competing paradigm to the classical SSVS approach is the Bayesian Lasso framework 



Park and Casella 20081 ] ). which is inspired from the frequentist Lasso ( Tibshirani 1996| ]). 



(see 
In 



order to compare the two frameworks, we adapted the Bayesi an Lasso to prob i t mix ed models. 
The Bayesian Las so has alreadv been used for probit models feae and MailiclJ ^) , and also 



for mixed models (jLegarra et al.l 20111 ]). Combining these two approaches, we considered a fully 



Bayesian analysis with the following prior distributions: 
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For each /3j,j = we consider a Laplace prior: Caplace{0,l/ V^). This Laplace 

distribution can be expressed as a scale mixture of normal distributio ns with independent 
exponentially distributed variances, see Andrews and Mallows! [l974j . This prior is then 
equivalent to f3j \ Xj ~ M{0,Xj) and Xj ~ £xpo{5/2). Writing A = diag{Xi, . . . ,Xp), we 
have /3 \ A ~7Vp(0,A). 



Concerning the random effects U and the variance covariance matrix D, we use the same 
classical priors than in the SSVS approach. 



• Following iPark and Casellal 2008l | , a hyperprior distribution is put on the Lasso parameter 
5: 5 ~ Gammaje, f), f denoting the scale parameter. O n the following experimental 
results, we found like lYi and Xul 2008f | and iLi et al.l 201 ll ] that the posteriors were not 
too sensitive to the hyperparameters e and /, as long as they were small enough so that 
the hyperprior is sufficiently flat. In practice we used e = / = 1, but results were similar 
with e = / = 10 for instance. 

The Bayesian Lasso estimates for the (3j are then obtained by a Gibbs sampler using the following 
posterior distributions: 

• For L, U and the af,l = 1,...,K, the full conditional distributions are the same than 
those in the SSVS method, using (|9l). (fTO]l and (fTTI). 

• For the /3, the posterior is: 



/3\L, [/, A ~ Mp{VaX^ (L - ZU), Va) with Va 



X^X + A 



-ii-i 



(16) 



The posterior distributions for the 1/Xj,j = 1, . . . ,p are inverse Gaussian: 

i — 



• The posterior for the Lasso parameter 5 is a gamma distribution: 



(17) 



(5 I A ~ Qammai^ + e, ( 



(18) 



Post-processing 

From the results of the Bayesian Lasso we obtain posterior estimates for the fijS and the AjS, 
and the variables can be selected by three different ways: 

1. On e can select the var iab les correspondi ng to an absolute value \ \ higher than a threshold, 
like lYi and Xul |2008l | or IlI et al.l |201ll ] for instance. 



2. iBae and Mallickl 2004l | among others considered that the variables associated to /3jS with 
smaller posterior variances have no effect and should be excluded from the model. There- 
fore, they proposed to select variables corresponding to high values of Xj. 

3. Finally, the results of the Lasso enable us to obtain posterior credible intervals (CI) for 
the (3jS. Hence we c an select variables corresponding to a (3j with a posterior CI which 
does not cover 0, see Kyung et al. 20ld ] for instance. 
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4 Experimental results 



4.1 Simulated data 

We simulated 200 binary observations and 300 variables, the observations being obtained using 
a probit mixed model with 5 of these variables and one random effect of length 4. Among the 
300 variables, 280 were generated from a uniform on [—5, 5] and denoted by VI, . . . , V280. Then 
10 variables denoted by ^281, . . . , 1/290 were build to be collinear to the first 10 variables, with 
a factor 2: for instance V282 = 2 x V2. One variable was build to be a linear combination of 
VI and V2 (1/291 = VI + V2), and another was build to be a linear combination of V3 and 
V4: {V292 = V3 — V4:). Finally, 8 variables were build to be linear combinations of variables 
5 to 12 and variables 13 to 20 (for instance V293 = V5 + 1/13). The five variables used to 
generate the binary observations were the first five: 1/1,1/2,1/3,1/4 and V5. The vector of 
coefficients associated with these variables was /3 = (1,-1,2,-2,3). The first 100 observations 
were part of the training set, and the last 100 were part of the validation set. In the training 
and the validation sets, 25 observations were associated with each component of the random 
effect, whom vector of coefficients was U = (—3, —2,2,3). We had only one random effect and 
the different components were supposed independent, hence we put D = a^I^. 



SSVS approach using the prior with a ridge parameter 

The objective was to assess the behavior of the proposed method when some variables are linear 
combinations of others, and to compare it to the case where no variable is linear combination 
of others. Therefore we performed 10 runs of the sampler using only the first 280 variables, 
and 10 runs using the 300 variables. In these two cases and for each run the same parameters 
were used: 5 variables were initially selected, one component of 7 was proposed to be changed 
at each iteration of the Metropolis-Hastings step, the prior of cr^ was a iTj = 5/280 

for all j when 280 variables were kept, vTj = 5/300 for all j when 300 variables were kept, 4000 
iterations were performed after a burn-in period of 1000 iterations, and each Metropolis-Hastings 
step consisted of 500 it erations. We decided to choose tq = 50, which is a standard choice, see 
Smith and KohnI |l997l ] for instance. The parameters A and r were then chosen as explained 



in [221 yielding A = 1/280 and r = 50.01075 when using 280 variables, and A = 1/300 and 
r = 50.00885 when using 300 variables. 

A final selection was performed for each of the 20 runs. Figure [T] presents two boxplots 
associated to 280 variables and 300 variables, respectively. For the run with 280 variables there 
is a gap between the variables V2, V3, VA and V5 and the others, hence we selected these four 
variables. For the run with 300 variables, there is a gap between the variables selected in more 
than 400 iterations and the others, hence we selected the eight corresponding variables. 

Table [U gives the variables kept in the final selections of the 20 runs. Among the runs with 
the first 280 variables, 3 among the 5 variables used to generate the data were in the final 
selection of almost all runs, and the variables 1/4 was in the final selection of half of the runs. 
Notice that VI was in none of the final selections. Among the runs with 300 variables, the 
variables V1,V2,V3 and V5 were present in most of the final selections, directly or indirectly 
through linear combinations. Contrarily to the runs with 280 variables, the variables 1/4 or 
1/284 were in none of the final selections, while the variables VI and 1/281 were in all the final 
selections. Concerning 1/4, it was indirectly in all the final selections through 1/292, which is a 
linear combination of V3 and 1/4. Eventually, the final selections of the runs with 300 variables 
appeared as relevant as the final selections of the runs with 280 variables, despite the fact that 
some variables were linear combinations of others. 
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run 5, 280 variables 



run 6, 300 variables 



O (M 



E 





Figure 1: Boxplots of the number of selections of a variable after the burn-in period. Each point 

represents a variable (or several variables if superposed) . The left boxplot corresponds to the run 5 with 
280 variables. The right boxplot corresponds to the run 6 with 300 variables. 

Remark 4 We obtained similar results with only 500 hum-in iterations and 500 post burn-in 
iterations, except that the variable VA was in none of the final selections. 



Variables 


Number of selections 


Number of selections 




among the 10 runs 


among the 10 runs 




with 280 variables 


with 300 variables 


VI 





10 


V2 


9 


8 


V3 


10 


2 


V4 


5 





V5 


10 


10 


V281 = 2x1/1 




10 


V282 = 2 xV2 




9 


V283 = 2 xV3 




3 


V284 = 2 xVA 







V285 = 2 x 1/5 


Not available 


10 


V291 = Vl + V2 




7 


V292 = V3-V4 
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Tabic 1: Number of final selections among the 10 runs with the first 280 variables and among the 10 
runs with 300 variables, for the variables VI, V2, VS, VA, Vh and linear combinations of these variables. 
No other variable was present in the final selections. 



To assess the relevance of the final selections, predictions were performed. Concerning the 
runs with 300 variables, we can not fit a model with all the variables in final selections, because 
some of them are linear combinations of others. In that case we decided to fit a probit mixed 
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model on the training set with the five linearly independent variables V281, 1^282, V283, V285 
and V292. Sensitivity and specificity results are presented in Table [2j For comparison, us- 
ing the five variables used to generate the data, we obtained a sensitivity and a specificity 
equal to 0.94 and 0.89. This is equivalent to results obtained using the selected variables 
V281, V282, V283, V285 and ^292. 



Variables selected among 280 


Variables selected among 300 


Variables 


Sensitivity 


Specificity 


Variables 


Sensitivity 


Specificity 


V2,V3,V5 


0.87 


0.89 


y281, V282 
V283, V285 


0.94 


0.89 


V2,V3,V4,V5 


0.93 


0.96 


and ^292 







Table 2: Sensitivity and specificity on the validation dataset. 



The number of components of 7 equal to 1 can vary from one iteration to another. Figure [2] 
shows, for the 10 runs with 300 variables, the number of iterations of the runs associated with 
a number of selected variables from 1 to 15. Similar results were obtained for the 10 runs with 
the first 280 variables. 

With 300 variables (non full rank matrix) 



6 7 8 9 1 11 
Number of selected variables 



13 14 15 



Figure 2: Number of iterations of the runs associated with a number of selected variables from 1 to 15. 
For the 10 runs, there were a total of 40000 post burn-in iterations. 



Bayesian Lasso approach 

Ten runs of the Bayesian Lasso were performed, with 5000 burn-in iterations and 15000 post- 
burn- in iterations. The results of the 7th run are illustrated in Figure [3l 
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Run 7 of the Lasso 



Run 7 of the Lasso 



Run 7 of the Lasso 



"* -r 



0) ^ 

.9 



,v2ei 

•V292, V25 



SV139.V172, V282 



50 100 150 200 250 300 

Variables 




Variables 



100 150 200 250 300 
Vartabfes 



Figure 3: Results of the 7th run of the Bayesian Lasso. On the left are represented the values of the /SjS 
as well as the threshold used. Another way to represent these values is by using a boxplot, represented 
in the middle. On the right are represented the values of the XjS and the threshold used. 

From this run, using the absolute values of the /3jS and a treshold of 0.85, the relevant vari- 
ables 1^281, ^282, V283, V285 and 1^292 were selected (^284 was indirectly selected by V292). 
However, we can note that the other relevant variables were not selected, and in particular the 
variables which are linear combinations of the selected variables. Moreover, the non-relevant 
variable V25 was selected. Using the values of the AjS and a threshold of 1, the relevant vari- 
ables y281, y283, y285 and 1^292 were selected, and the non-relevant variable V25 was still 
selected. Finally, using posterior CI, only the variables 1^283 and 1^285 were selected. 
The variables selected during the ten runs are given in Table [31 



Variables 


Using the \/3j\s 


Using the XjS 


Using the posterior CI 


selected in 9 runs 


V285 


V285 


F285 


selected in 7 runs 


V283, V292 


V283 




selected in 6 runs 




V292 


F283 


selected in 3 runs 


V282 






selected in 2 runs 


V281 


y281,y282 




selected in 1 runs 


V25, V208, V209, V250 
F87,yi27,y270 


F25, 1/250, 1/87, F127 
F270 


F250,y87,yi27 



Table 3: Variables selected during the 10 runs of the Bayesian Lasso approach, using three different 
methods, see 13.31 



Generally, it appeared that using the values of the /3jS and the A^s enabled us to select more 
relevant variables than using the posterior CI, but at the price of also selecting non-relevant 
variables. Besides, we can note that this Bayesian Lasso approach did not give very stable 
results. Indeed, some runs gave relevant selections of variables, like the run 1 for instance from 
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which the variables 1^281, ^282, 1^283, y285 and 1^292 were selected, while other runs gave 
selections of variables quite less relevant, like the run 10 from which the variables V127 and 
1^270 were selected, or like the run 4 from which the variables V285, 1^208, 1^209 and 1^250 were 
selected. Between these two cases, some runs enabled us to select some relevant variables, but 
not all of them, like the runs 2, 5, 8 and 9 from which the variables ^283, V285 and V292 were 
selected. Eventually, it appeared that the subsets of selected variables obtained were less stable 
than those obtained by the SSVS approach using the prior with a ridge parameter, and that 
more "noise" was observed since several non-relevant variables which were selected in only one 
of the ten runs. 



4.2 Illustrations through real data 

As an illustration, Affym etrix micr o array experiment results from patients with breast cancer 



were used. Data used in iBaragattil 201 1[ | were considered, see there for more details. Briefly, 
the patients come from three different hospitals, and the objective was to select some variables 
(probesets) which are indicative of the activity of the estrogen receptor (ER) gene in breast 
cancer. The hospital was considered as a random effect in the model, thus accounting for the 
different experimental conditions between the three hospitals. For each patient, the expressions 
of 275 probesets were kept, among which some were known to be relevant to explain the ER 
status (corresponding to variables 148, 260, 263 and 273). We used a training set made of 100 
patients, and a validation set of 88 patients. In order to have a potentially singular X!^X^ matrix, 
we added three variables to the data matrix X. These variables were linear combinations of the 
known relevant variables, hence X was no more of full rank: V276 = 2 x yi48, ^277 = —V260 
and V278 = ^263 + V273. We had only one random effect, which corresponded to the different 
hospitals. The hospitals are supposed independent, hence we put D = a^Is. 



SSVS approach using the prior with a ridge parameter 

We performed 10 runs of the sampler using only the first 275 variables, and 10 runs using all 
the 278 variables. In these two cases and for each run the same 100 patients and the same 
parameters were used. As in the previous illustration we chose tq = 50. The parameters A and 
r were chosen as explained in l2.2l yielding A = 1/275 and c = 50.0009 when using 275 variables, 
and A = 1/278 and c = 50.00088 when using 278 variables. 

Figure H] presents a boxplot of a run with 275 variables and a boxplot of a run with 278 
variables. Following the same reasoning as in the previous example, six variables were selected 
from the left boxplot and three from the right boxplot. 

Table [4] gives the variables kept in the final selections of the 20 runs. As in the previous exam- 
ple, the final selections of the runs with 278 variables appeared as relevant as the final selections 
of the runs with 275 variables, despite the fact that some variables were linear combinations of 
others. 
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Figure 4: Boxplots of the number of selections of a variable after the burn-in period. The left boxplot 
corresponds to the run 3 with 275 variables and the right boxplot corresponds to the run 8 with 278 
variables. 



Variables 


Corresponding 


Number of selections 


Number of selections 




probesets 


among the 10 runs 


among the 10 runs 






with 275 variables 


with 278 variables 


V260 


228241_at 


10 


3 


V273 


205862_at 


9 





VU8 


209604_s_at 


5 


1 


V263 


228554_at 


10 





V83 


203628_at 


7 





V66 


202088_at 


1 





V212 


215157_x_at 


1 





V277 = -V260 


collinearity 


Not available 


3 


V278 = V2Q3 + ^273 


linear combination 


10 



Table 4: Number of final selections among the 10 runs with the first 275 variables and among the 10 
runs with 278 variables, for the different variables and linear combinations. No other variable was present 
in the final selections. 



Predictions were also performed. Table [5] contains sensitivity and specificity results. For 
comparison, using the four relevant variables ^260, ^263, yi48 and V273, we obtained a sen- 
sitivity equal to 0.94 and a specificity equal to 1. This is equivalent to results obtained using 
only the two variables V278 and ^277. 
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Variables selected among 275 


Variables selected among 278 


Variables 


Sensitivity 


Specificity 


Variables 


Sensitivity 


Specificity 


V260, V273, V263 


0.92 


1 


V278 
V278, V277 


0.87 
0.94 


0.97 
1 



Table 5: Sensitivity and specificity on the validation dataset. 



Bayesian Lasso approach 

Ten runs of the Bayesian Lasso were performed, with 5000 burn-in iterations and 15000 post- 
burn-in iterations. The results of the 5th run are illustrated in Figure [5j 



Run S of the Lasso Run 5 of the Lasso Run 5 of the Lasso 




50 100 150 200 250 50 100 150 200 250 

Variabies Variables Variabfss 



Figure 5: Results of the 5th run of the Bayesian Lasso. On the left are represented the values of the /3jS 
as well as the threshold used. Another way to represent these values is by using a boxplot, represented 
in the middle. On the right are represented the values of the XjS and the threshold used. 

From this run, using the absolute values of the /3jS and a treshold of 2.5, the relevant variables 
1^260, 1^278 and 1^277 were selected. Moreover, the less relevant variable F137, yi59, V1A7, 
V271 were selected. Using the values of the AjS and a threshold of 7, the relevant variables 
1/260, ^278 and V277 were selected, and the less relevant variables VW3, V137, V159, VU7 
and V271 were selected. Finally, using posterior CI, no variable was selected. 
The variables selected during the ten runs are given in Table [6l 
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Variables 


Using the \/3j\s 


Using the XjS 


Using the posterior CI 


selected in 7 runs 


V Zio 






selected in 4 runs 


V277, V260 






selected in 3 runs 




F260, y263, F277 




selected in 2 runs 


y263,yii4,yi4o,y272 
yi47,y27i,y83 


yi4o,yi45,y83 


y278, yi45, 1/271, 1/266 


selected in 1 runs 


V2ib, VoU, 1/145, V71 
F102,yi37,yi59,y59 
l/84,yi05,F78,y95 

yi6i,y266,yi05,y2 

yi61, 1/266, yio5,y2 


Kll4, V/8U, V2b2, V71 
F78,y215,yi65,yi02 
F60,yi37,yi59,yi47 

y27i,yi03,y59,y84 

F106, V2, V105, 1/266, yi61 


f 1 1 1 A J r 1 An T /'0'70 J r 1 A'l 

VWA, 1/14U, V 272, v/147 

U84,U83,U95, 1/115 
yi61,F105,F272,y276 



Table 6: Variables selected during the 10 runs of the Bayesian Lasso approach, using three different 
methods, see 13.31 



The same general remarks than those made from the simulated example can be done (see 

Remark 5 We compared the SSVS approach using the prior with a ridge parameter with the 
Bayesian Lasso. Note that in the case of s maller prob lems, the marginal likelihoods of all the 
models can he calculated using the method of lchiA hood j. and it is then possible to confront them 



to the results obtained by the SSVS or the Bayesian Lasso approaches. 
4.3 Sensitivity analysis 

Concerning the variable selection coefficie nt r, the rnethod of variable selection without the ridge 
parameter is not sensitive to its value (seeE^^ ^) , but it is mainly due to the fact that 



the number of variables selected at each iteration of this algorithm was fixed. It is no more the 
case for the algorithm proposed in this paper, hence it seems necessary to assess its sensitivity 
to this parameter. Therefore we studied the infiuences of r and A when they are chosen as 
proposed in Section 12.21 or arbitrarily. We also looked at the behavior of the algorithm when 
the value of the ttj, the prior distribution parameters of cr^ and the number of iterations vary. 
For this sensitivity study we used the example with simulated data (Section 14. ip with 300 vari- 
ables. The different values of the parameters are presented in Table [71 In this table, the number 
of relevant variables in the final selections of the runs are given, the twelve relevant variables 
being VI, V2, V3, V4, V5, V281, V282, V283, V284, V285, V291 and V292. The sensitiv i ty wa s 
assessed by using the relative weighted consistency measure of Somol and Novovicova 2008f |. 



denoted by CWrei- It is a measure evaluating how much subsets of selected variables for several 
runs overlap, and it shows the relative amount of randomness inherent in the concrete variable 
selection process. It takes values between and 1, where represents the outcome of completely 
random occurrence of variables in the selected subsets and 1 indicates the most stable variable 
selection outcome possible. 
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^^(1,1) 




8 




20 


100 


100.0354 dSD 


l/p=l/300 


5/300 


Xg(2,5) 


4000 (1000) 


8 


1 


21 










Xg(5,2) 
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500 (500) 


8 




23 


100 


100.0354 dSD 


l/p=l/300 


5/300 


Xa(l,l) 


4000 (1000) 


8 


1 


24 












40000 (10000) 


8 





Table 7: Parameters of the runs for the sensitivity study and associated relative weighted consistency 
measure of Somol and Novovicova CWrei- 



The algorithm was generally not sensitive to the values of the hyper-parameters, since most 
of the relevant variables were usually selected. The boxplots obtained were often similar to the 
right boxplot of Figure [H In particular, the algorithm was not overly sensitive to the values 
of T and A. There was only one run (the 13th) where no variable could be really distinguished 
from others, and none of the top-ranked variables was a relevant one, see Figure [6l This run 
corresponds to a large r and a small A. The runs 17 and 18 are also noticeable, as all relevant 
variables were finally selected, see Figure [6j They correspond to high values of tTj, and the cost 
for these relevant runs was longer computational times. Eventually, we observed that the values 
of r and ttj play a role in the number of variables selected at each iteration of the algorithm. 
The value of r modified the distribution of this number, see Figure [71 Besides, this number 
increased with the value of ttj, see Figure [H However, even if the number of variables selected 
at each iteration of the algorithm was high, it did not influence the final selections of the runs, 
and it did not influence the number of variables which were distinguishable from others. 
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Figure 6: Boxplot of the number of selections of a variable after the burn-in period, for two runs with 
300 variables. 
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Figure 7: Number of iterations of the runs 1,4 and 5 associated with a number of selected variables from 
1 to 14. For each run, there were 4000 post burn-in iterations. 



5 Discussion 



Classical stochastic search variable selection methods often propose the use of the g-prior of Zell- 
ner. This prior can not be used if p > n, or if some variables are linear combinations of others. 
In particular, this last case can occur when several datasets with common covariates are merged. 
The prior for 0^ studied in this ma r iuscri pt is a possible alternative, and is a reparametrization 
of the prior of iGupta and IbrahimI 2007l | where r and A can be chosen independently. In this 
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run 16 = run 17 a run 18 
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Figure 8: Number of iterations of the runs 16,17 and 18 associated with a number of selected variables 
from 1 to 100. For each run, there were 4000 post burn-in iterations. 

case the parameter r does not influence the coefficient of the identity matrix. Using this prior, a 
way to jointly choose r and A was suggested and the results obtained on simulated data and on 
a real dataset were good and stable, whether some variables were linear combinations of others 
or not. Moreover, when r and A were chosen independently, the proposed method proved to be 
robust to the choices of these hyper-parameters, as shown in the sensitivity analysis. 

In practice, even if X!^X.y is theoretically invertible, some variables can be highly correlated 
and X!^X^ can be computationally singular. Moreover, we do not necessarily know if some vari- 
ables are linear combinations of others and to avoid a computational problem we suggest in any 
cases to use the prior and the algorithm proposed in this paper. Once a final selection of variables 
benoted by 7+ is obtained by our algorithm, the rank of the matrix with all the variables finally 
selected, denoted by should be computed. If this matrix is not of full rank, we can take a 
submatrix of of full rank as a new data matrix. Note that it is easier to take linearly in- 
dependent columns of than linearly independent columns of X, especially if p is quite large. 

We compared the results of the proposed SSVS method using a prior with a ridge parameter, 
with results obtained from the competing approach of the Bayesian Lasso. This last approach 
can also be used when some variables are linear combinations of others or if p > n. Moreover, 
its implementation is quite easy, and does not necessitate Metropolis-Hastings steps. Compared 
to the SSVS method proposed in l3.2|, an iteration is then less computing demanding. However, 
the Bayesian Lasso approach seems less practical than the SSVS approach. Indeed, in formula 
()16p it is X which is used and not a submatrix X^ like in the SSVS approach. The computing 
time to multiply or to invert matrices is then higher, and it could become an issue if p is very 
large. Besides, on the previous simulation and example, it appeared than more iterations are 
needed by the Bayesian Lasso compared to the SSVS approach (20000 vs 5000). Concerning the 
results, it appeared that the runs of the Bayesian Lasso were less stable than those of the SSVS 
approach, and that more "noise" was observed, since several non or less relevant variables were 
selected in only one of the ten runs. It is important to note that we adapted a simple Bayesian 
Lasso approach to probit mixed models, but many extensions of the classical Lasso exist and 
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can be adapted in Bavesian approaches, like the fused Lasso (ITibshirani et alJ [20051]), the group 
Lasso ( Yuan and Lin 2006l |) or the Elastic Net ( Zou and Hastie 20051 ] ) for instance. Recently, 



Kyung et al.l [20101 ] showed how to adapt this extensions in Bayesian approaches, and it could 



be interesting to compare them to the SSVS approach. 



Several extensions of the proposed SSVS method using a prior with a ridge parameter can be 
done in the future. First, in classical cases using the (7-prior, many authors suggested to put prior 
distributions on r, see Section [TJ Following them, an idea could be to put prior distributions on 
the hyper-paramete rs r and A. However, these authors often used Bayes Factors [see for instance 
Celeux et all boOfJ ] and not a latent 7 vector as done in this paper. They were then more in the 



spirit of model selection than in the spirit of variable selection. Finally, it would be interesting 
to have a non-supervised criterion to decide which variables should be in the final selection of a 
run. We suggested to represent the vector of the numbers of iterations during which variables 
have been selected by a boxplot, and to use a treshold to decide which variables should be in 
the final selection. However, we could develop a more formal criterion. 
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